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Abstract 



The exponential family of models is defined in a general setting, not relying on probability 
theory. Some results of information geometry are shown to remain valid. Exponential families 
both of classical and of quantum mechanical statistical physics fit into the new formalism. 
Other less obvious applications are predicted. For instance, quantum states can be modeled 
as points in a classical phase space and the resulting model belongs to the exponential family. 

1 Introduction 

The exponential family of statistical models is an important notion in statistics. The parametrized 
statistical model 9 G M" — > pe{o) belongs to the exponential family\^\ if there exist functions a{9), 
c(a), and Hj{a),j = 1, 2, ■ ■ ■ , n, such that the probability distributions P0{a) can be written as 



The choice of signs conforms with the conventions of statistical physics where the Boltzmann-Gibbs 
probability distribution is usually written as 



This distribution is parametrized by the inverse temperature /3 and clearly belongs to the expo- 
nential family. The function H{a) is called the Hamiltonian, the normalization Z{/3) is called the 
partition sum. The function c(a) is a prior weight. In many cases it is identically equal to 1. But 
for instance, if the underlying measure space A is the set of integers N, then c(a) = 1/a! might be 
an appropriate choice. 

Recently, generalizations of the notion of an exponential family have been introduced^ |3l 
in El ini [71 m ini . They provide a solid theoretical underpinning for research in non-extensive 
statistical physics [TTl [T2]. The equilibrium probability distributions (pdfs) studied in this context 
are related to Amari's a-family of pdfs [13]. The latter is the subject of research in information 
geometry|14]. where techniques from differential geometry are applied to probability theory. 




(1) 




(2) 
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The present work has been inspired by the efforts of Tops0e [T5l [16] to formulate the notion 
of an exponential family in an abstract setting of game theory. One of his goals is to formulate 
information theory without involving statistics. From we quote: "In 1983 Kolmogorov stated 
that 'Information theory must precede pobability theory and not be based on it'." A seminal paper 
in this direction is the work of Csiszar|T7]. The settings of this paper can be reformulated in the 
terminology used in the present work. More recent contributions in the area of machine learning 
are found in [TSl US] . 

The next Section introduces the abstract settings of the formalism. In Section 3 the notion of 
Entropy is added. Section 4 gives a definition of an exponential family of models. Section 5 shows 
that both the standard and the quantum mechanical notions of an exponential family fit into the 
present formalism. The final Section formulates some conclusions. 

2 Data set models 

2.1 The information framework 

The elements of our framework are 

The space of data sets X is an abstract topological space. Following Tops0e [151 [15] an element 
a; of X can be called a truth. However, it is closer to the tradition of probability theory 
to consider the space of possible outcomes of an experiment. Therefore we refer to x as a 
data set. In the probabilistic formulation of information theory X is the space of pobability 
distributions over a finite alphabet A. In the quantum mechanical context it is the space of 
quantum states, for instance described by normalized wave functions or by density operators. 
Other examples are given in what follows. 

The space of questions Q is a dual space of X. Each question g is a real function continuously 
defined on an open subset of X. The evaluation of q in the point x is the answer to the 
question and is denoted {x\q) instead of q{x) to stress that the space of questions is a 
linear space but not necessarily an algebra with the usual pointwise product. For instance, 
each hermitian bounded operator A on the Hilbert space of wavefunctions ip determines an 
everywhere defined continuous function, given by 



Here, (</>, V') is the scalar product of two elements 0,?/' of the Hilbert space £^(M^,C). Note 
that we follow the notational conventions of the physics literature. In the case of an un- 
bounded operator, such as the position operators or many of the Hamilton operators, some 
caution is needed. One must select a topology which makes ([3]) continuous on the domain 
of definition of the operator. 



In statistical physics a model is determined by its Hamiltonian. In the present context this is 
replaced by one or more questions. However, we want to make the definition slightly more general 
by introducing the following definition. 



2P^{tP\A) = {i:,Ai:). 



(3) 



2.2 



What is a model? 
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Definition 1 A data set model is a topological manifolq^ M together with a continuous map n 
defined on an open subset of the space X of data sets taking values in M. 

Clearly, a set of questions qi, - ■ ■ ,qn with a common open domain of definition D defines a 
manifold M C M*^ as the range of the map fi defined by fi{x) = U when Uj = {x\qj),j = 1, 2, ■ ■ ■ , n, 
provided that the set /i(-D) is open in W^. 

The converse is also true. Indeed, one has 

Proposition 1 A local parametrization U G C — > mu G M o/ the manifold M, /i defines 
questions qj by {x\qj) = Uj when = niu- 



Proof 

The questions are well-defined. The domain of definition is the set of x for which belongs to 
the range of the map U & D ^ mu G M. This is an open set because any homeomorphism is an 
open map. It is also bijective so that there is a unique U such that mu = fi{x). Hence, the answer 
to the questions qj is unique. 

The map x — )■ {x\qj) = Uj is continuous because is continuous and U E D ^ mu G M is 
open. 

□ 

The advantage of defining a model in terms of manifolds is that the dependence on a specific 
choice of questions has been eliminated. 

Example 1 The Euclidean space H = M.^ is a space of data sets. The unit sphere 

S2 = {xeR^ : \x\ = 1} (4) 

is a model embedded in M'^. The map fi is defined on M.^ \ {0} by fi{x) = x/\x\. The questions qi 
and q2 defined for x^ > by 

(x\qi) = — and {x\q2) = —■ (5) 

X3 X3 

determine a parametrization of the northern hemisphere of 82- It is given by 

U ^ Xu = {UiX3,U2X3,X3)^ with X3 = ^ ^ ^ . (6) 

a/ 1 + t/i +U2 



3 Maximum entropy principle 
3.1 Entropy functions 

The amount of information contained in the data set x is given by its entropy S{x). It is a lower 
semi-continuous functioiil^ with values in the extended reals [— oo,-|-oo]. Usually the entropy is 
assumed to be concave. However, in general the space X does not have an affine structure. On 

^ M is locally Euclidean, this means that there exists in each point m of M an integer n > 0, an open set D of 
M", together with a map U E D ^ xu G M which is a homeomorphism between D and a neighbourhood of m. 
^We do not use this property in the present paper. 
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the other hand, models are manifolds. Hence, by transferring the notion of entropy to the model 
points the concavity as a function of parameters can be discussed. 

Given a data set model M, fi the entropy S{m) of a model point m is defined by the maximum 
entropy principle of Jaynes|2lj 

S{m) = sup{S'(x) : fi{x) = m} < +00. (7) 

If m is not in the range of n then S{m) = —00 is chosen. Note that we use here the map /x as a 
constraint on the data sets involved in the maximization procedure, instead of using a specific set 
of questions qi, - ■ ■ ,qn- 

Since M is a manifold we can now investigate whether local parametrizations U — )■ mu exist 
such that S{mu) is a concave function of the parameters U . In what follows the notation S{U) = 
S{mu) will be used. Note that S{U) depends on the choice of local parametrization while S{m) 
is independent of parametrization. 

Proposition 2 Let U G -D C M" — )■ mjj he a local parametrization of a data set model M, /i, Let 
gi, • ■ ■ , g„ he the accompanying set of questions as defined hy Proposition 1. Then one has locally 

S{U) = snp{S{x) : {x\qj) = Uj for j = 1,2, ■■■ ,n} < +00. (8) 

The proof of this result is straightforward. 

Example 2 Consider the parametrization of the northern hemisphere of the unit circle, as dis- 
cussed hefore. The entropy function 

5(a;) = -1 - |a;|(ln|a;| - 1) (9) 

is maximal when \x\ = 1. The entropy function S{m) vanishes on the model manifold. 

3.2 Perfect data sets 

In the example of the sphere the supremum in ([7]) is actually a maximum. The entropy function 
S{x) takes on its maximal value for the points of 5*2. It is then obvious to call these points 
perfect data sets. Such privileged data points do not always exist. For instance, the model for a 
quantum particle can be a point particle localized at a position q in M^. The map fi is defined 
by fi{ip) = (tplQtp). But there are no quadratically integrable wavefunctions which describe a 
quantum particle perfectly localized at the position q. In such a case one expects an entropy 
function S{iIj) which is such that no maximum is attained for any wave function ip. 

The relation between model points and perfect data sets may be a one-to-many relation. This 
is made clear in the following example. 

Example 3 In the case of linear regression a data set consists of a finite sequence of pairs of real 
numhers 

{xi, yi), (x2, Z/2), ■ ■ ■ , (a^n, Vn), (10) 

with at least two distinct pairs. The model space consists of straight lines not parallel to the y- 
axis. A data set is perfect if the data points fall on a single line. But with a single straight line 
correspond many perfect data sets. See the Figure [H 
The interesting questions are given hy 
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Space of data sets 



Model 




y=ax+b 



Figure 1: Embedding of the model into the space of data sets. 




with Z = ^(xj — XjY. They are only defined on data sets for which Z 0. They are interesting 
because they return the parameters a and b of the fitted line y = ax + b. These two questions 
uniquely determine the model. A meaningful entropy function is 




Its value on perfect data sets is —a^ — 6^. For other data sets is S{x) < S{fi{x)). 

4 Exponential families 

The notion of an exponential family of models is strongly related to the concept of canonical 
parametrizations. These are introduced now. 

4.1 Contact transforms 

In thermodynamics, the Massieu function $(^) is the Legendre transform of the entropy S{U). 
This inspires for the following definition. 

Definition 2 Let be given a local parametrization U E D G MJ^ — )■ mu of a data set model M, /x. 
Assume that the model entropy S{U) is locally finite. Then the Massieu function is defined by 



Theorem 1 Let be given a local parametrization U G D C M" — t- rriu of a data set model M, 
Let qi,--- ,qn be the accompanying set of questions defined by Proposition 1. Assume that the 
model entropy S{U) is locally finite. Then one has 




(12) 



n 




(13) 
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$(6*) is a convex function. In particular, it is finite on a convex subset 9 o/M". 



Proof 

Remember that the questions are such that fi{x) = mu holds if and only if {x\qj) = Uj for 
j = 1,2, ■■■ ,n. Take x so that is local. Then one has fj,{x) = mu with U E D. But 
S{U) < +00 implies that S{x) < +oo. Hence one has 



On the other hand, if $(^) < +oo then for any e > there exists U such that 



(14) 



^{e)-e<S{U)-Y,0,U,. 

Similarly, there exists x, satisfying {x\qj) = Uj for 1 = 1, 2, ■ ■ ■ , n, such that 

S{U)-e< S{x). 

All together one has 



(15) 



(16) 



$(0) -2e < S{x)-J2^jUj- 



;i7) 



Since e > is arbitrary one concludes that the equahty holds in f[T3|) . 

Finally, if $(6') = +oo then there exists U such that S{U) — X]j=i ^j^j is arbitrary large. But 
then there exists x such that n{x) is local and S{x) — J2]=i ^ji^llj) is arbitrary large. Hence, also 
in this case the equality holds in f[T51) . 

The convexity statement is easy to show. Let A in [0, 1]. One can assume that $(^i) and ^{62) 
are finite because otherwise the convexity statement is empty. Then for any x with local n{x) one 
has 



S{x)-J2[^9^^^ + i^-^W2,]{x\q,) 



< A$(^i) + (1-A)<l'(02). 



+ (1-A) 



This implies $(A^i + (1 - A)02) < A<l>(^i) + (1 - A)$( 



(18) 



□ 



In the physics literature one is used to work with the free energy rather than with Massieu's 
function. If the inverse temperature /3 is the only parameter then the free energy equals 
and minimizes {x\q) — S{x)//3. 
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4.2 Canonical parametrization 



Let us now return to a data set model with a locally defined parametrization. Then the Legendre- 
Fenchel transform can be used to introduce a canonical parametrization. The attribute 'canonical' 
refers to the canonical ensemble of statistical physics. In the context of the exponential family one 
speaks about the canonical form of the probability distribution. But in the present approach the 
canonical parametrization is defined before introducing the exponential family and is independent 
of it. 

Definition 3 Let be given some local parametrization 9 E Q G — > mg of a data set model 
M, /i. The parametrization is said to be canonical if there exists another local parametrization 
U e D ^mu such that 

• S{U) < +00 for all U in D; 

• The relation me — mu defines a diffeomorphism between © and D; 

• Under this diffeomorphism is 

n 

m-s{u) + Y,e,Uj = o. (19) 

3=1 

To make the distinction between the two parametrizations 6 E Q G M" — )■ mo and U G D G 
M" mu wc call the latter the associated energy parametrization. The motivation is that in 
statistical physics the components of U have the meaning of energies. 

Theorem 2 // the parametrization 9 G Q ^ mg of a data set model M.,fi is canonical then 
the Massieu function $(6*) is a strictly convex differentiable function and there exist questions 
gi, • • • satisfying 

d 

— — $(^) = —{x\qj) for all x satisfying fj,{x) — me. (20) 



Proof 

Let U G D G W ^ mu be the local parametrization appearing in the definition of a canonical 
parametrization. Note that 

n 

c^m-Y.U3{Q-0j) (21) 

is a tangent plane in the point 9. The requirement that niu = riie determines a diffeomorphism 
implies that a small change of 9 corresponds with a small change of U and hence a small change 
in the slope of the tangent plane. This proves that the tangent plane is unique. One concludes 
that ^{9) is differentiable and that 

The strict convexity follows because the correspondence ^ ■<-> C/ is bijective. 
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Let ,qn be the questions defined in Proposition 1. They satisfy {x\qj) = Uj for j — 

1, 2, • • • ,n when //(x) = niu- Hence the statement of the Theorem follows. 

□ 

The second derivatives of $(^) define a metric tensor 

'-'^^^ = MM ^ ~W ^ ^ 

This matrix is a generalization of Fisher's information matrix. 

Example 4 Let X be the set of all 2-by-2 density operators (these are positive trace class operators 
with trace equal to 1). The entropy function is the von Neumann entropy 

= -Trplnp. (24) 

The model M coincides with the space of data sets X. Let us calculate a parametrization which is 

canonical. 

Three questions are needed to determine uniquely a density operator p. In terms of the three 
Pauli matrices aj these are 

{p\q,)=TTpaj, J = 1,2,3. (25) 

Then one can write 



The von Neumann entropy becomes 

S{p) =\n2 
The Massieu function reads 



S{p) = In 2 - i(l + \U\) ln(l + \U\) - ^(1 - \U\) ln(l -\U\). (27) 



3 

m = sup{S{U)-J2(^,Uj: \U\<1}. 

U 



(28) 



The maximum is reached when 

Note that this implies that \U\ = tanh|^|. Hence the inverse relation is 

Uj = -^^tanh\9\. (30) 

One concludes that the map U ^ 6 is a diffeomorphism from the interior of the unit sphere onto 
Tn>3 



pQ can now be written as 



1 1 ^ 
Pe = -I — tanh 1 6*1 6* .cr.- 

=1 

E ^^^^ ■ (31) 



2 21^1 
1 



exp 



2cosh(|^|) 

This is a canonical parametrization of the 2-by-2 density matrices 
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4.3 Dual Relations 



Let be given a canonical parametrization 6 — )■ mg of model M, /x, together with the associated 
energy parametrization U — )■ mu- From f lT9| [20]) then follows the pair of dual relations 



(9$ dS 

where f/ — )■ 6' is the diffeomorphism determined by the relation mu = mg. 

The function S{U) is strictly concave. This follows because the matrix of second derivatives of 
S{U) equals minus the inverse of the metric tensor gj,k{0) defined by ( l23i) . The latter is positive 
definite because by Theorem 2 the Massieu function is strictly convex. 

If the metric tensor gj^k{d) is sufficiently smooth then the model space M is (locally) a Rie- 
mannian manifold with respect to each of the two parametrizations. They are dual to each other 
in the sense that the metric tensor of one parametrization is the inverse of that of the other. The 
curvature of the manifold in the Levi-Civita connection vanishes because the metric tensor is the 
matrix of second derivatives of a convex function. Hence the manifold is flat. 

4.4 Logarithmic maps 

Definition 4 A logarithmic map L maps model points onto questions. 

For instance, the Boltzmann-Gibbs-Shannon entropy S{p) can be written as the average of the 
measurable quantity — Inp(z). The probability distribution p belongs to the space of data sets X. 
But — Inp(z) is used as a question, the answer of which is the value of the entropy function S{p). 
In this example the logarithmic map is deflned on all data sets. But we need it further on only 
for perfect data sets or for model points. 

The logarithmic map L can be used to deflne a divergence or relative entropy between data 
sets and model points. 

Definition 5 The divergence of a data set x from a model point m is given by 

D{x\\m) = sup{S{y) + {y\Lm) : fi{y) = m} — S{x) — {x\Lm). (33) 

Clearly, if fi{x) = m then D{x\\m) > with equality if and only if x maximizes S{x) + {x\Lm) 
under the constraint fi{x) = m. We call such x canonical data sets. 



4.5 Exponential families 

In the previous subsection the notion of a logarithmic map was introduced to prepare for the 
deflnition of the exponential family. 

Definition 6 A model M, fi with logarithmic map L belongs to the exponential family of data 
set models if the model space M is covered with local parametrizations 6^ G 6 — )■ mg, which are 
canonical, and the associated energy parametrizations f/ G -D C M" — ?■ are such that 

Lmg = a{e)-Y^ ^jlj for all 6 e Q, (34) 
j 

where the questions qj are defined by {x\qj) = Uj when fi{x) = mu (see Proposition 1). 
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In the example of the 2-by-2 density matrices (see (13T1) ) is 

In pe = -\n2cosh{\9\) -^Ojaj. (35) 

j 

Hence the model belongs to the exponential family. One has a{6) = — In 2 cosh(|^|). The questions 
Qj are given by (!25|) . 

The property (I34p can be used to simplify the Definition 5 of divergence. One obtains 

D{x\\m0) = sup{S{y) + {y\a{e) -^ejqj) : ij,{y) = mg} 

j 

—S{x) — {x\a{9) — 

j 

= snp{S{y) - {yl^Ojqj) : fi{y) = me} 

j 

-S{x) + ^Oj{x\qj) 
j 

= m-S{x) + J2e,{x\q,). (36) 
j 

From Theorem 1 now follows that D{x\\m0) > for all x for which fi{x) is local. Equality then 
holds if and only if the data set is canonical. 
Note that one can write, using (HM . 



D{x\\me) = S{U)-J2^jUj 



-^Oj{x\qj) 



(37) 



If fi{x) = rriu then {x\qj) = Uj. Hence 

D{x\\me) = S{U) - S{x) >0 if ij{x) = mu. (38) 

Therefore, in the case of a model belonging to the exponential family, canonical data sets are 
perfect data sets as well. 

4.6 Pythagorean Theorems 

The model map fi can be seen as an orthogonal projection of X onto the manifold M. This is 
supported by a Pythagorean theorem in which the divergence plays the role of a distance squared. 
Introduce the divergence between two model points m and m' by 

D{m\\m') = inf{D{x\\m') : p{x)=m}. (39) 

The following result shows that this divergence is of the Bregman type|17[ [20] . It has a nice 
geometric interpretation. It is the difference between the value $(C) of the Massieu function in 
the point ( and the value of the plane tangent in the point 6. 

Proposition 3 Let be given a model M, fi with logarithmic map L belonging to the exponential 
family. Consider a local parametrization ^ G — t- me and the associated energy parametrization 
U E D G M" — 7- mu as in the definition of the exponential family. Then one has 

DimoWm^) = ^O-m + J2(Q-0,)U,. (40) 
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Proof 

First calculate using (136|) 

D(me||m^) = inf {D {y\\m(^) : fi{y) = me} 

= m-sup{Siy)-J2Q{yh)-- M = mo}. (41) 

j 

Now use that {y\qj) is constant on the set of y for which fi{y) = me. Hence one has 

DimeWm^) = $(C) - ^(t/) + CA' (42) 

j 

with U so that mu = me. Using ( |T9ll this becomes ( HOl) . 

□ 

The Pythagorean theorem[T7j for the projection of an arbitrary data set x G X onto the 
manifold M by means of the model map fi now follows readily. See the Figure |2j 




Figure 2: Projection of a data set x onto the manifold M using the model map /i. 

Theorem 3 Let be given a model Wl, fi with logarithmic map L belonging to the exponential family. 
If fi{x) = me then 

D{x\\mg) + D{mg\\m^) = D{x\\mf;). (43) 

Proof 

Use (HO]) to obtain 

D{x\\me) + D{me\\m^) = ^{Q - S{x) + ^Q{x\qj) = D{x\\m^). (44) 

j 

This is (|43l). 

□ 

Following [in] , we can also formulate a Pythagorean theorem involving only model points. 
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Theorem 4 Consider a model M, n with logarithmic map L belonging to the exponential family. 
Let G — )■ nig and U & D ^ mjj he canonical and energy parametrizations as mentioned in 
Definition 6. Let 6,(j^ be points in 0. Let U and V he dual coordinates such that mu = and 
my = TTLi^. Assume that 

E(0-O)(t^.-V;) = 0. (45) 
j 

Then one has 

D(m5)||m^) + D{m(^\\m^) = D{mg\\m^). (46) 



Proof 

This follows immediately from fl4Up . 

□ 



5 Applications 

We show below how the standard notion of an exponential family of statistical models fits into the 
present formalism. Also the analogue notion in quantum statistics is discussed. The generalized 
exponential families[2] introduced in the context of Tsallis' non-extensive statistical mechanics |12j. 
or even in a broader context, do fit as well, but will not be treated here. 

5.1 Statistical models 

Here we show that the above framework is a generalization of the notion of the exponential family 
of statistical models p!]. 

Let X be the affine space of probability distributions over the discrete measure space A. Let c(a) 
be a prior weight on A. Questions are real functions / of A, seen as maps p — )■ Kpf = ^(iP(a)/(o.)- 
The answer to a question /, given p, is therefore given by 

{p\f)=Kf- (47) 
The entropy function is that of Boltzmann-Gibbs-Shannon (BGS) and is given by 

s{p) = ~{p\L{p)) = -Ep(«)i^^$y- (48) 

Let 6* G — i- be a statistical model with probability distributions pg given by ([1]). For 
convenience assume c(a) = 1 and introduce the notation Kg = Ep^. Let Uj{6) = KgHj. The model 
space M is the subset of X given by 

M={pg:ee 0}. (49) 

Introduce the model map /x by 

fi{p) = pg if EpH, = EgH, for J = 1, ■ ■ ■ , n. (50) 
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Assume for convenience that the functions Hj (a) are bounded. Then the model map is everywhere 
defined and continuous in the /i-metric of X. 

It is well-known that the probability distributions of a model belonging to the exponential 
family maximise the BGS-entropy under the constraint Uj{9) = KgHj, j = 1, - ■ ■ ,n — in our 
terminology the pg are perfect data sets. Hence one has 

8(9) = SiU) = S{pe) = a{9) + J29^U,i9). (51) 

j 

In particular, there follows that $(^) = a{9). 

Generically, the relation between U and ^ is a diffeomorphism. Indeed, one has 



9jM 



^Pe{a)Hk{a) 



89, 



= ^Pe{a)Hj{a)Hk{a) + ^Pe{a)-^Hj 

= EeHjHk-iEeHj)iEeHk). (52) 

If the constant function is not a linear combination of the hamiltonians Hj then the matrix gj,k{9) 
is positive definite. This implies that the relation between U and ^ is a diffeomorphism. 

One concludes that the parametrization 9 ^ pg is canonical. 

Introduce a logarithmic map L by 

{Lpe){a) = \npe{a). (53) 

The corresponding divergence is 

D{p\\pe) = (54) 

This is the standard expression for the divergence/relative entropy. 

It follows now from (151 p that the model M, /i with this logarithmic map belongs to the expo- 
nential family provided that no linear combination of the hamiltonians Hj is a constant function. 



5.2 Quantum statistical physics 

In quantum statistics the probability distributions of classical statistics are replaced by density 
matrices/density operators on a separable Hilbert space. They form the space X of data sets. 
Questions are bounded operators on the Hilbert space. The evaluation function is 

p e X ^ {p\A) = TipA. (55) 

It is continuous for instance in the Hilbert-Schmidt norm. The entropy function is the von Neu- 
mann entropy (^^. 

A quantum statistical model is a homeomorphism 9 E Q G M" — pg. The model space is 
M = {pg : 9 G ©}. The model bolongs to the exponential family of quantum models if there exist 
self-adjoint operators Hi, - ■ ■ , if„ such that 

1 " 

Po = ^^expi-Y,djH,) (56) 
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with Z{9) = Tr exp(— X]j=i ^j^j)- The model map fj, satisfies /i(p) = pe if Tr pHj is well-defined 
and equals Uj = Tr peHj for j = 1, ■ ■ ■ , n. 

The po of the form (IS^ maximize the von Neumann entropy under the constraint of a given 
value of the Uj. The proof is based on Klein's inequality — see for instance [221 IS]- Iii particular 
the are perfect data sets. One obtains 

n 

S{U) = S{pe) = $(^) + J2 ^J^i *(^) = 1^ ^(^)- (^'^) 

i=i 

One calculates 

= Tr pHjHk - — Tr pi^^ 

= Trpi7,i7,-(Trp/7,)(Trpi/fc). (58) 

The eigenvalues of this matrix cannot be negative. If they are strictly positive for all 6 then the 
relation between U and ^ is a diffeomorphism and the parametrization ^ — > p^ is canonical. 
Introduce the logarithmic map defined by Lpg = Inp^. One clearly has 

n 

Lpg = - In Z{e) - J2 ^jHj. (59) 

i=i 

Hence, the model belongs to the exponential family according to Definition 6. A short calculation 
then yields 

D(p||p,) = Trp(lnp-lnp,). (60) 
This is the standard expression for relative entropy in quantum statistical physics [23]. 



5.3 Coherent states 

Now we discuss an example which shows that our framework extends well beyond the (quantum) 
statistical context. We consider the phase space of classical mechanics as a model for a state space 
of quantum mechanical wave functions. 

For simplicity consider a quantum particle in one dimension. The space X of data sets consists 
of wave functions ip{x) which are twice differentiable and normalized so that 

/ dx\^{x)\'^ = 1. (61) 

Note that two wave functions ijj{x) and e^°'^^^\ with a constant, determine the same point of X. 

Questions are linear operators A acting on the Hilbert space of square integrable complex 
functions. The evaluation function is given by 

{iplA) = / dx^p{x){A^){x). (62) 
Jm. 
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dip 

Introduce position and momentum operators by Qipix) = xipix) and Pipix) = —ih—-. Note 

ox 

that these are unbounded operators. Hence we need a topology on X which is such that the two 
questions ip ^ {ip\Q) and {'iplP) are continuous. Then they define a continuous map of X 
into the model space M = M^, which is the phase space of a particle in classical mechanics. 
Introduce now the entropy function 

SW = lma)\'-{iP\a^a), (63) 

where the annihilation operator a is defined by 

1 



1 r 
^ -Q + i-P 



(64) 



with r and h positive constants. Then X together with this entropy function is a data set space. 

The solution of the eigen equation atp = zip, with complex z, is denoted ipz and is called a 
coherent state. Note that 

Ui = {i/^M = r^/2^z and U2 = {i/j^lP) = -V2Qz. (65) 



and 



Clearly is 



and 



S{^l'.) = -lma)f = -l\z\', (67) 



Siip) < — \{'p\a)f for all V e X for which {-pia) = z. (68) 

Hence, the coherent states are perfect data sets. In particular, the entropy S{rn) of the model 
point m — my is 

The Massieu function equals 

$(^) = sup{5(?7) - Q^Ur - 92U2}. (70) 
u 



The maximum is reached when 



^i = -lc/i and 02 ^-^U2. (71) 



The result is 



m - jOf + ^^01 (72) 
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It is now straightforward to verify that the 6'-parametrization of is canonical. 
Introduce a logarithmic map L by 

L{mu) = -^\z\'^ + ^za^ + ^za, (73) 

where z is obtained from f p^ . There follows immediately that 

L{mu) = -^9) - 9iQ - 92P. (74) 
This shows that the model belongs to the exponential family. The divergence equals 

Dmmu) = ^ma)-z\'+{<P\a^a)-Ma)\'>0. (75) 

In addition, D{(j)\\ipz) = is equivalent with z = (0|a) and acj) = {'ilj\a)(t). But this implies that (j) 
equals ipz, up to a phase factor which can be neglected because it has no physical meaning. Hence, 
the divergence vanishes if and only if equals V'z up to a constant phase factor. 



6 Conclusions 

The notion of an exponential family of models can be generalized to a context not involving proba- 
bility theory. From the point of view of statistical physics this is of interest because the exponential 
family is at the heart of the discipline and quantum statistical physics involves quantum probabil- 
ity rather than classical probability theory. But the formalism presented here is so general that it 
has many other applications. Only one such example has been elaborated in subsection 5.3. Some 
other applications have been mentioned without proof. These will be taken up in further work. 

By the present effort we hope to contribute to a more general theory of information, includ- 
ing previous extensions in the directions of machine learning, statistical inference and quantum 
information. 
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